Skip to content

tests: diagnostic on release build#2177

Merged
samrose merged 16 commits into
developfrom
sam/diag-90-cleanup-exit-123
May 30, 2026
Merged

tests: diagnostic on release build#2177
samrose merged 16 commits into
developfrom
sam/diag-90-cleanup-exit-123

Conversation

@samrose
Copy link
Copy Markdown
Collaborator

@samrose samrose commented May 28, 2026

Purpose

Diagnostic-only PR for the amd64 stage-1 AMI build failure in ami-release-nix.yml.

The failing shape is:

  • amazon-ebssurrogate exits with Script exited with non-zero exit status: 123
  • failure only reproduces on amd64 builds
  • arm64 builds pass
  • visible failure location has moved between runs, so the last printed line is not reliable evidence of the actual failing command

This PR is intentionally temporary. The goal is to capture enough evidence to identify the root cause, then remove or reduce the diagnostics.

Current working theory

The latest evidence points away from a deterministic command failure in 90-cleanup.sh.

The strongest current theory is that the amd64 source build instance, SSH provisioner session, or bootstrap process is being killed or interrupted outside normal Bash control near the late Ansible / cleanup boundary. One plausible cause is memory pressure triggering OOM behavior; this is still a hypothesis, not proven.

Why this changed:

  • the EXIT trap added to the parent bootstrap did not fire in the latest amd64 failures
  • Ansible completed with failed=0
  • Packer immediately started source instance cleanup after the provisioner failure
  • the visible death point differs between amd64 matrix legs

What has been tried

Round 1: instrument 90-cleanup.sh

Added tracing and in-chroot logging to scripts/90-cleanup.sh:

set -x
exec > >(tee -a /tmp/90-cleanup.log) 2>&1
trap 'echo "[90-cleanup] EXIT $? at line $LINENO: $BASH_COMMAND" >&2' ERR

Wrapped the parent chroot /mnt /tmp/90-cleanup.sh call in ebssurrogate/scripts/surrogate-bootstrap-nix.sh to capture cleanup_rc and tail /mnt/tmp/90-cleanup.log before re-exiting.

Result from workflow run 26598216064, job 78374796167:

  • chrooted cleanup started
  • output reached exec, tee, ERR trap setup, and the /tmp permission check
  • output then cut off mid-xtrace around the chmod 1777 /tmp area
  • the cleanup ERR trap did not print
  • the parent log-tail block did not print

Initial suspicion was chmod 1777 /tmp, but later runs weakened that.

Round 2: pre-chroot chmod isolation test

Added a plain pre-cleanup chroot test that ran /bin/chmod 1777 /tmp via chroot /mnt /bin/bash -c ..., without the 90-cleanup.sh tee/exec wrapper.

Result from workflow run 26633420910, job 78487682887:

  • parent shell printed only part of the multi-line bash -c argument
  • no visible output from the chrooted bash body
  • set +e did not allow the parent shell to continue to the pre_rc echo

This invalidated the clean theory that chmod 1777 /tmp itself was the root cause.

Round 3: chroot micro-tests

Added five micro-tests before cleanup to isolate:

  • /mnt rootfs binary presence
  • chroot /bin/echo
  • script-file chroot execution
  • single-line bash -c
  • multi-line bash -c redirected to a host-side file

These were later removed because the run shape had already shifted and the diagnostics were adding noise without surviving the failure reliably.

Round 4: parent EXIT trap

Added a parent-shell EXIT trap in surrogate-bootstrap-nix.sh intended to fire no matter where Bash exited. It captured:

  • memory
  • disk usage
  • inode usage
  • kernel-message tail
  • Ansible log tail
  • cleanup log tail

Also removed the round-2/round-3 micro-tests.

Latest run checked: 26652852736.

Results:

  • all three amd64 jobs failed:
    • orioledb-17 amd64: job 78555585948
    • PG17 amd64: job 78555585990
    • PG15 amd64: job 78555585994
  • all arm64 jobs passed
  • the parent EXIT trap did not print in the amd64 failures
  • Ansible completed successfully with failed=0
  • visible death points differed:
    • PG17 died after Ansible output, before visible update_systemd_services
    • PG15 died while copying 90-cleanup.sh
    • orioledb-17 died after entering 90-cleanup.sh / tee

This strongly suggests the failure is outside ordinary Bash error handling.

Ruled out so far

  • Supascan baseline validation as the actual amd64 failure point. It is skipped on amd64.
  • A deterministic chmod 1777 /tmp failure.
  • Commit #2162 as the sole trigger; an amd64 build after that commit had succeeded.
  • AMI snapshot / instance cleanup work as the cause of this provisioner failure.
  • A normal Ansible task failure in the latest run; Ansible reported failed=0.

Current local next diagnostic

The next local diagnostic change is scoped to ARCH=amd64 only, so it applies across PG15, PG17, and orioledb-17 amd64 builds without changing arm64 behavior.

It adds:

  • active 16 GiB build-time swap on /mnt/tmp/build-swapfile
  • sanitized periodic watchdog output:
    • free -h
    • swapon --show
    • df -h / df -i for key mounts
    • top memory process names only, not full command lines
  • diagnostic override of OOM panic behavior:
    • vm.panic_on_oom=0
    • kernel.panic=0
  • amd64-only phase markers around:
    • execute_playbook
    • update_systemd_services
    • clean_system
  • sanitized parent EXIT trap output, avoiding raw dmesg, raw Ansible logs, and raw cleanup logs in GitHub output

The intent is to test the memory-pressure/OOM hypothesis without printing likely-sensitive console/log contents into GitHub Actions.

What to look for next

  • If amd64 passes with active swap and OOM panic disabled, memory pressure is strongly implicated.
  • If watchdog output shows memory or swap exhaustion before failure, focus on the high-RSS process list around the last heartbeat.
  • If failure persists with no watchdog warning and no EXIT trap, look next at SSH/provisioner/session termination or external runner/build-instance behavior.
  • If failure becomes visible as a killed process instead of silent 123, OOM behavior is likely confirmed as the failure class.

@samrose samrose requested review from a team as code owners May 28, 2026 19:12
@samrose samrose force-pushed the sam/diag-90-cleanup-exit-123 branch 3 times, most recently from 35f7f95 to 237e4a5 Compare May 29, 2026 10:57
@samrose samrose force-pushed the sam/diag-90-cleanup-exit-123 branch from 1550b0e to 9a8f8e9 Compare May 30, 2026 12:02
@samrose samrose enabled auto-merge May 30, 2026 12:15
@samrose samrose added this pull request to the merge queue May 30, 2026
Merged via the queue into develop with commit 2c2a7e2 May 30, 2026
649 checks passed
@samrose samrose deleted the sam/diag-90-cleanup-exit-123 branch May 30, 2026 13:02
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants